feat(yp): faster typescript serialization#23713
Open
AztecBot wants to merge 10 commits into
Open
Conversation
cfdb8a1 to
080c41d
Compare
…path
Replace the recursive Tx.toBuffer() chain (Buffer alloc at every node,
Buffer.concat at every level) with a single growable ArrayBuffer the whole
object graph streams into and that is sliced once at the root.
The migration contract is the optional-sink overload:
toBuffer(): Buffer;
toBuffer(sink: BufferSink): void;
Pass a sink and it writes + returns undefined; omit it and it returns its
own buffer. Unmigrated children fall back via return value, so it lands
incrementally and existing toBuffer() callers keep working.
Converts the Tx spine end-to-end: Tx/TxArray, TxHash/TxHashArray,
PrivateKernelTailCircuitPublicInputs (+partials), PrivateToRollupAccumulatedData,
ChonkProof/ChonkProofWithPublicInputs, HashedValues, Vector, BaseField.
BufferSink.writeBigInt uses 4x DataView.setBigUint64 limbs for 32-byte
fields (no hex round-trip, no per-field alloc). On a modeled rollup Tx
(~2660 fields) byte-identical to today and ~11x faster end-to-end; the
naive per-byte shift loop is actually slower than legacy, so picking the
right field encoder is the win.
Adds toBuffer cases (private + public) to stdlib/src/tx/tx_bench.test.ts
recording per-op microseconds + payload bytes; wired into CI via the
existing bench_cmds entry, dashboard series Tx/{private,public}/toBuffer/*.
fromBuffer/zod path is unchanged and out of scope.
080c41d to
f7e7ba2
Compare
…r ~9x Tx.toBuffer Real bench (stdlib/src/tx/tx_bench.test.ts) on this PR's spine-only baseline vs after: - Tx/private/toBuffer: 1.96 ms -> 0.22 ms (~8.9x) - Tx/public/toBuffer: 3.11 ms -> 0.34 ms (~9.1x) - Tx/private/toBufferReusedSink: 1.86 ms -> 0.16 ms (~12x) - Tx/public/toBufferReusedSink: 3.04 ms -> 0.29 ms (~10.5x) cpu-prof on the prior code showed serializeToSink dominating ~50% of total time: the rest-args + Array.isArray + Buffer.isBuffer + 5x typeof dispatch ran per element of every nested array, and serializeToSink(sink, ...obj) allocated a fresh rest-args array for each recursion (1632-element spread per ChonkProof, every call). Changes: - foundation/serialize/buffer_sink: split dispatch into per-element serializeOneToSink and an inner serializeArrayToSinkInner that recurses with the array reference, no spread. Hot-path objects exposing toBuffer first so Fr/Fq/migrated leaves skip the primitive-typeof chain. serializeArrayToSink uses the same inner. - foundation/curves/bn254/field: BaseField caches its 32-byte serialized form. The cache is populated eagerly in the constructor when built from a 32-byte Buffer (the deserialization path) and lazily on first toBuffer otherwise. toBuffer returns a defensive Buffer.from copy or writes the cached bytes straight into a sink, with no bigint->bytes round-trip on the hot path. The Buffer ctor copies via new Uint8Array to defend against caller-side mutation; the copy-ctor aliases the cache (it is never mutated post-assignment). - stdlib/tx/tx: pre-size the BufferSink with the last serialized length so the no-sink fresh-allocation path skips the 1k->64k doubling-growth cost. Hint lives in a module-level WeakMap rather than an instance field so deep-equality assertions on Tx (which compare enumerable own properties) are unaffected.
…ench The previous commit's BaseField byte cache helped the existing steady-state bench (2051 calls per Tx) but adds a 32-byte Uint8Array alloc plus a Buffer.from copy on every cold-path Fr.toBuffer call. The synthetic bench was a misleading measurement since prod typically serializes each Tx once. Measured impact, dispatch fix + sink presize only (this commit) vs. with the cache: variant no cache with cache private steady 0.28 ms 0.22 ms (cache +20%) private cold 0.31 ms 1.00 ms (cache -3.2x, real regression) public steady 0.38 ms 0.34 ms public cold 0.37 ms ~1 ms The dispatch fix + WeakMap sink-size hint already give ~7-10x vs. the spine-only baseline (1.94 ms / 3.22 ms) without any state held on Fr instances, no deserialize-time copies, no extra memory per long-lived Tx in the mempool. Also adds two cold-start bench cases (one toBuffer per fresh Tx, no warm cache, no sink reuse) so the dashboard tracks the realistic per-tx cost alongside the steady-state numbers, and a future byte-cache attempt can be evaluated honestly.
…hint Three small adds on top of the dispatch-fix + sink-presize commits, all without the byte-cache tradeoff: - foundation/serialize/buffer_sink: split a no-width writeField(value) off writeBigInt so V8 can specialize the Fr/Fq 32-byte limb encoder without the wider routine's width branch. Add writeFields(arr) which iterates a flat field-element array inline with no per-element Sinkable dispatch. - foundation/curves/bn254/field: BaseField.toBuffer(sink) now calls writeField. - stdlib/proofs/chonk_proof: both proof classes switch the 1632-element field vector from serializeToSink(... this.fields) to sink.writeFields(this.fields), skipping per-Fr serializeOneToSink dispatch on the largest leaf array in a Tx. - stdlib/tx/tx: fall back to a process-wide largest-seen Tx size when the per-instance WeakMap sink-size hint is missing, so the no-sink fresh-allocation cold path (different Tx every call) also benefits from sink pre-sizing once any Tx in the process has been serialized. Best-of-3 AVG us/op vs. spine-only baseline (1940 / 3220): variant current spine baseline speedup private steady 220 1940 ~8.8x public steady 325 3220 ~9.9x private reused 167 1860 ~11.1x public reused 266 3040 ~11.4x private cold ~275 - - public cold ~445 - - Cold numbers are inherently noisier (each timed call serializes a different Tx with different field shapes, so V8 inline caches churn) but stay well below the steady baseline.
ludamad
approved these changes
May 31, 2026
Adds buffer_sink.test.ts covering the new BufferSink module: byte-for-byte equivalence of every sink writer against the legacy serializeBigInt/free_funcs, serializeToSink dispatch (mixed/nested/migrated/legacy-node fallback), capacity growth, reset reuse, overflow/negative guards, and a sink->BufferReader round-trip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The arbitrary-width branch of writeBigInt used a per-byte BigInt shift loop, which benchmarked as the slowest option (slower than the legacy hex round-trip) because each byte allocates a fresh BigInt. Replace it with 64-bit setBigUint64 limbs written from the least-significant tail, plus a <=7-byte leftover head for widths that aren't a multiple of 8. Faster than the legacy path at every width; multiples of 8 (the only widths used in practice: 8/16/32) take the pure-limb path. The width===32 unrolled fast path is retained. Extends the width coverage in buffer_sink.test.ts.
ludamad
approved these changes
May 31, 2026
fcarreiro
reviewed
May 31, 2026
|
|
||
| // Per-instance sink size hint. Held externally (WeakMap) so it does not appear as an enumerable instance | ||
| // field, which would otherwise make deep-equality assertions fail when one side has been serialized. | ||
| const txSizeHints = new WeakMap<Tx, number>(); |
Contributor
There was a problem hiding this comment.
We should know this number right? Can we hardcode it even if approximate?
Collaborator
There was a problem hiding this comment.
embarrassingly, I missed that it snuck this in. Will get a hardcoded number, makes sense
…constant The per-instance WeakMap + process-wide largest-seen-size heuristic both existed only to pre-size the BufferSink the no-sink Tx.toBuffer() path allocates. The bootstrapped bench measures the actual Tx payloads at: - private-only: 81763 bytes - public-with-enqueued-calls: 129128 bytes A single 131072-byte (128 KiB) presize covers both shapes without any doubling-growth ensure() resize on the cold path, and is the same allocation the WeakMap fast-path made on the steady-state hot path anyway. Removing the hidden state matches Adam's review feedback and brings the bench numbers within noise of the WeakMap version: variant weakmap (prev) constant (this) private steady 220 us ~244 us public steady 325 us ~351 us private reused 167 us ~176 us public reused 266 us ~276 us private cold ~275 us ~273 us public cold ~445 us ~427 us Real-world Txs that exceed 128 KiB keep working — the sink falls back to its standard doubling growth, just paying the existing cost.
…ng value The (0x0123456789abcdefn, 7) case was 57 bits (high byte 0x01) but width=7 only holds 56 bits. Legacy serializeBigInt silently truncates the high byte; the new writeBigInt is strict and throws to match its 32-byte path. Drop the overflowing high byte so the value fits, keeping the test's stated intent (\"matches serializeBigInt byte-for-byte\") aligned with both impls. The out-of-range strictness is already covered by the dedicated \"rejects out-of-range bigints\" block.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Result: 3-4x faster typescript serialization.
Give another mode to
toBufferthat, if a buffer 'sink' is passed, write to that instead. Thus a top-level toBuffer() call, instead ofwe have, approximately
This allows use to avoid intermediate buffer allocations and additionally we did some trial and error to speed things up on optimizations done by V8, which are hopefully representative.